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Abstract 

Compressed sensing deals with efficient recovery of analog signals from linear encodings. 
This paper presents a statistical study of compressed sensing by modeling the input signal as 
an i.i.d. process with known distribution. Three classes of encoders are considered, namely 
optimal nonlinear, optimal linear and random linear encoders. Focusing on optimal decoders, 
we investigate the fundamental tradeoff between measurement rate and reconstruction fidelity 
gauged by error probability and noise sensitivity in the absence and presence of measurement 
noise, respectively. The optimal phase transition threshold is determined as a functional of the 
input distribution and compared to suboptimal thresholds achieved by popular reconstruction 
algorithms. In particular, we show that Gaussian sensing matrices incur no penalty on the 
phase transition threshold with respect to optimal nonlinear encoding. Our results also provide 
a rigorous justification of previous results based on replica heuristics in the weak-noise regime. 

Keywords: Compressed sensing. Shannon theory, phase transition, Renyi information di- 
mension, MMSE dimension, random matrix, joint source-channel coding. 

1 Introduction 
1.1 Setup 

Compressed sensing [3, 4] is a signal processing technique that compresses analog vectors by 
means of a linear transformation. By leveraging prior knowledge of the signal structure (e.g., 
sparsity) and by designing efficient nonlinear reconstruction algorithms, effective compression is 
achieved by taking a much smaller number of measurements than the dimension of the original 
signal. 

An abstract setup of compressed sensing is shown in Fig. 1: A real vector x" G is mapped into 
yk g j^fc j-jy encoder (or compressor) / : M" — )• M'^. The decoder (or decompressor) : M'^ — )• M" 
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receives y , a possibly noisy version of the measurement, and outputs as the reconstruction. 
The measurement rate, i.e., the dimensionahty compression ratio, is given by 



Most of the compressed sensing hterature focuses on the setup where 

a) performance is measured on a worst-case basis with respect to x"'. 

b) the encoder is constrained to be a linear mapping characterized by a A; x n matrix A, called 
the sensing or measurement matrix, which is usually assumed to be random, and known at the 



c) the decoder is a low-complexity algorithm which is robust with respect to observation noise, for 
example, decoders based on convex optimizations such as £i-minimization [6] and £i-penalized 
least-squares (i.e. LASSO) [7], greedy algorithms such as matching pursuit [8], graph-based 
iterative decoders such as approximate message passing (AMP) [9], fast iterative shrinkage- 
thresholding algorithm (FISTA) [10], etc. 

In contrast, in this paper we formulate an information-theoretic fundamental limit in the fol- 
lowing setup: 

a) the input vector x"" is random with a known distribution and performance is measured on an 
average basis. ^ 

b) in addition to the performance that can be achieved by the optimal sensing matrix, we also 
investigate the optimal performance that can be achieved by any nonlinear encoder. 

c) the decoder is optimal:^ 

• In the noiseless case, it is required to be Lipschitz continuous for the sake of robustness; 

• In the noisy case, it is the minimum mean-square error (MMSE) estimator, i.e., the condi- 
tional expectation of the input vector given the noisy measurements. 

Due to the constraints of actual measuring devices in certain applications of compressed sensing 
(e.g., MRI [23], high-resolution radar imaging [24]), one does not have the freedom to optimize 
over all possible sensing matrices. Therefore we consider both optimized as well as random sensing 
matrices and investigate their respective fundamental limits achieved by the corresponding optimal 
decoders. 

^Alternative notations have been used to denote the signal dimension and the number of measurements, e.g., 
(m,n) in [4] and {N,K) in [5]. 

■^Similar Bayesian modeling is followed in some of the compressed sensing literature, for example, [11, 9, 12, 13, 
14, 15, 16, 17, 18, 19]. 

^The performance of optimal decoders for support recovery in the noisy case has been studied in [20, 21, 22] on a 
worst-case basis. 




(1) 




Figure 1: Compressed sensing: an abstract setup. 



decoder. 
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1.2 Phase transition 



The general goal is to investigate the fundamental tradeoff between reconstruction fidelity and 
measurement rate as n — )• oo, as a functional of the signal and noise statistics. 

When the measurements are noiseless, the goal is to reconstruct the original signal as perfectly 
as possible by driving the error probability to zero as the ambient dimension, n, grows. For many 
input processes, e.g., independent and identically distributed (i.i.d.) ones, it turns out that there 
exists a threshold for the measurement rate, above which it is possible to achieve a vanishing error 
probability and below which the error probability will eventually approach one for any sequence of 
encoder-decoder pairs. Such a phenomenon is known as phase transition in statistical physics. In 
information-theoretic parlance, we say that the strong converse holds. 

When the measurement is noisy, exact analog signal recovery is obviously impossible and we 
gauge the reconstruction fidelity by the noise sensitivity, defined as the ratio between the mean- 
square reconstruction error and the noise variance. Similar to the behavior of error probability 
in the noiseless case, there exists a phase transition threshold of measurement rate, which only 
depends on the input statistics, above which the noise sensitivity is bounded for all noise variances, 
and below which the noise sensitivity blows up as the noise variance tends to zero. 

1.3 Signal model 

Sparse vectors, supported on a subspace with dimension smaller than n, play an important 
role in signal processing and statistical models. A stochastic model that captures sparsity is the 
following mixture distribution [12, 25, 9, 14, 17, 18, 19]: 

P=(l-7)<5o + 7^c, (2) 

where 6q denotes the Dirac measure at 0, Pc is a probability measure absolutely continuous with 
respect to the Lebesgue measure, and < 7 < 1. Consider a random vector independently 

1 II II P 

drawn from P. By the weak law of large numbers, - ||X"||q — > 7, where the "£o norm" ||-||q denotes 
the number of non-zeros of a vector. This corresponds to the regime of proportional (or linear) 
sparsity. In (2), the weight on the continuous part 7 parametrizes the signal sparsity and Pc serves 
as the prior distribution of non-zero entries. 

Generalizing (2), we henceforth consider discrete- continuous mixed distributions (i.e., elemen- 
tary distributions [26]): 

Px = {l- l)Pd + iPc, (3) 

where Pd is a discrete probability measure and Pc is an absolutely continuous probability measure. 
For simplicity we focus on i.i.d. input processes in this paper. Note that apart from sparsity, there 
are other signal structures that have been previously explored in the compressed sensing literature. 
For example, the so-called simple signal in infrared absorption spectroscopy [27, Example 3, p. 
914] is such that each entry of the signal vector is constrained to lie in the unit interval, with 
most of the entries saturated at the boundaries (0 or 1). Similar to the rationale that leads to (2), 
an appropriate statistical model for simple signals is a mixture of a Bernoulli distribution and an 
absolutely continuous distribution supported on the unit interval, which is a particular instance of 
(3). Although most of the results in the present paper hold for arbitrary input distributions, with 
no practical loss of generality, we will be focusing on discrete-continuous mixtures (i.e., without 
singular components) because of their relevance to compressed sensing applications. 
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1.4 Main contributions 



We introduced the framework of almost lossless analog compression in [12] as a Shannon- 
theoretic formulation of noiseless compressed sensing. Under regularity conditions on the encoder 
or the decoder, [12] derives various coding theorems for the minimal measurement rate involving 
the information dimension of the input distribution, introduced by Alfred Renyi in 1959 [28]. Along 
with the Minkowski and MMSE dimension, we summarize a few relevant properties of Renyi infor- 
mation dimension in Section 2. The most interesting regularity constraints are the linearity of the 
compressor and Lipschitz continuity (robustness) of the decompressor, which are considered sepa- 
rately in [12]. Section 3 gives a brief summary of the non-asymptotic version of these results. In 
addition, in this paper we also consider the fundamental limit when linearity and Lipschitz continu- 
ity are imposed simultaneously. For i.i.d. discrete-continuous mixtures, we show that the minimal 
measurement rate is given by the input information dimension, i.e., the weight 7 of the absolutely 
continuous part. Moreover, the Lipschitz constant of the decoder can be chosen independently of 
n, as a function of the gap between the measurement rate and 7. This results in the optimal phase 
transition threshold of error probability in noiseless compressed sensing. 

Our main results are presented in Section 4, which deals with the case where the measurements 
are corrupted by additive Gaussian noise. We consider three formulations of noise sensitivity: 
optimal nonlinear, optimal linear and random linear (with i.i.d. entries) encoder and the associated 
optimal decoder. In the case of i.i.d. input processes, we show that for any input distribution, 
the phase transition threshold for optimal encoding is given by the input information dimension. 
Moreover, this result also holds for discrete-continuous mixtures with optimal linear encoders and 
Gaussian random measurement matrices. Invoking the results in [29], we show that the calculation 
of the reconstruction error with random measurement matrices based on heuristic replica methods in 
[14] predicts the correct phase transition threshold. These results also serve as a rigorous verification 
of the replica calculations in [14] in the high-SNR regime (up to 0(0"^) as the noise variance 
vanishes) . 

The fact that randomly chosen sensing matrices turn out to incur no penalty in phase transition 
threshold with respect to optimal nonlinear encoders lends further importance to the conventional 
compressed sensing setup described in Section 1.1. 

In Section 5, we compare the optimal phase transition threshold to the suboptimal threshold 
of several practical reconstruction algorithms under various input distributions. In particular, we 
demonstrate that the thresholds achieved by the £i-minimization decoder and the AMP decoder 
[25, 13] lie far from the optimal boundary, especially in the highly sparse regime which is most 
relevant to compressed sensing applications. 

2 Three dimensions 

In this section we introduce three dimension concepts for sets and probability measures involved 
in various coding theorems in Sections 3 and 4. 

2.1 Information dimension 

A key concept in fractal geometry, in [28] Renyi defined the information dimension (also known 
as the entropy dimension [30]) of a probability distribution. It measures the rate of growth of the 
entropy of successively finer discretizations. 

Definition 1. Let X be a real- valued random variable. Let m G N. The information dimension 
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of X is defined as 

HX) = ,i,„ (4) 

m-s>oo log m 

If the limit in (4) does not exist, the liminf and hmsup are called lower and upper information 
dimensions of X respectively, denoted by d{X) and d{X). 

Definition 1 can be readily extended to random vectors, where the floor function [-J is taken 
componentwise. Since d{X) only depends on the distribution of X, we also denote d{Px) = d{X). 
The same convention also applies to other information measures. 

The information dimension of X is finite if and only if the mild condition 

^(m)<oo (5) 

is satisfied [12]. A sufficient condition for d{X) < oo is E[log(l + \X\)\ < oo, much milder than 
finite mean or finite variance. 

Equivalent definitions of information dimension include:^ 

• For an integer M >2, write the M-ary expansion of X as 

X=[X\+^iX),M-\ (6) 

i>l 

where the i*^ M-ary digit {X)i = [Af*Xj — M|_M*~^XJ is a discrete random variable taking 
values in {0, . . . , M — 1}. Then d{X) is the normalized entropy rate of the digits {{X)i}: 

^ H{{X), {XU ^ 
^ ' m^oo mlogM ^ ^ 

• Denote by B{x,S) the open ball of radius 5 centered at x. Then (see [31, Definition 4.2] and 
[12, Appendix A]) 

Sio log 

• The rate-distortion function of X with mean-square error distortion is given by 

Rx{D)= inf I{X;X). (9) 

E|X-XP<D 



Then [32, Proposition 3.3] 



= lim^(?^^. (10) 



Let N ~ M{0, 1) be independent of X. The mutual information I{X; ^/snrX + N) is finite if 
and only if (5) holds [33]. Then [34] 

„ , I(X;J5nrX + N) , , 

d{X)= hm 1. (11) 

snr^oo ^Iogsnr 



*Tlie lower and upper information dimension are given by the liminf and limsup respectively. 
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The alternative definition in (7) implies that d^X"^) < n (as long as it is finite). For discrete- 
continuous mixtures, the information dimension is given by the weight of the absolutely continuous 
part. 

Theorem 1 ([28]). Assume that X has a discrete- continuous mixed distribution as in (3). If 
H{[X\) < oo, then 

d{X) = 7. (12) 



In the presence of a singular component, the information dimension does not admit a simple 
formula in general. One example where the information dimension can be explicitly determined is 
the Cantor distribution^ which can be defined via the following ternary expansion 

X = Y,{X\^-\ (13) 

where {X)iS are i.i.d. and equiprobable on {0,2}. Then Px is absolutely singular with respect to 
the Lebesgue measure and d[X) = log3 2 0.63, in view of (7). 

2.2 MMSE dimension 

Introduced in [29] , the MMSE dimension is an information measure that governs the high-SNR 
asymptotics of the MMSE in Gaussian noise. Denote the MMSE of estimating X based on Y by 

mmse(X|y) = inf E [(X - /(y))^] (14) 

= E [(X - E[X|y])2] = E [var(X|y)] , (15) 

where the infimum in (14) is over all Borel measurable /. When Y is related to X through an 
additive Gaussian noise channel with gain -y/snr, i.e., Y = ^/snrX+N with N ~ A/'(0, 1) independent 
of X, we denote 

mmse(X, snr) = mmse(X|-v/srirX + A^). (16) 
Definition 2. The MMSE dimension of X is defined as 

Si(X) = lim snr • mmse(X,snr). (17) 

snr— >cxD 

Useful if the limit in (17) does not exist, the liminf and lim sup are called lower and upper MMSE 
dimensions of X respectively, denoted by ^{X) and &{X). 

It is shown in [29, Theorem 8] that the information dimensions are sandwiched between the 
MMSE dimensions: if (5) is satisfied, then 

< ^{X) < d{X) < d{X) < ^{X) < 1. (18) 

For discrete-continuous mixtures, the MMSE dimension coincides with the information dimension: 

Theorem 2 ([29, Theorem 15]). If X has a discrete-continuous mixed distribution as in (3), then 
&{X) = 7. 

It is possible that the MMSE dimension does not exist and the inequalities in (18) are strict. For 
example, consider the Cantor distribution in (13). Then the product snr • mmse(X, snr) oscillates 
periodically in log snr between ^{X) « 0.62 and &{X) 0.64 [29, Theorem 16]. 
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2.3 Minkowski dimension 



In fractal geometry, the Minkowski dimension (also known as the box-counting dimension) [35] 
gauges the fractality of a subset in metric spaces, defined as the exponent with which the covering 
number grows. The (e-)Minkowski dimension of a probability measure is defined as the lowest 
Minkowski dimension among all sets with measure at least 1 — e [36] . 

Definition 3 (Minkowski dimension). Let A be a nonempty bounded subset of M". For 6 > 0, 
denote by Na{S) the (5-covering number of A, i.e., the smallest number of ^2-balls of radius 6 needed 
to cover A. Define the (upper) Minkowski dimension of A as 

dimeA = limsup — (19) 
5^0 log J 

Let /U be a probability measure on (M",;BiRn). Define the (e-)Minkowski dimension of /u as 

dhi^B(/") = inf{dhi^B(^) : K^) > 1 - e}- (20) 



Minkowski dimension is always nonnegative and less than the ambient dimension n, with 
dimevl = for any finite set A and dime^ = n for any bounded set A with nonempty interior. An 
intermediate example is the middle-third Cantor set C in the unit interval: dimeC = log3 2 [35, 
Example 3.3]. 

3 Noiseless compressed sensing 
3.1 Definitions 

Definition 4 (Lipschitz continuity). Let [/ C M" and / : [7 — > M'^. Define^ 

Lip(/)4sup "^(;)-^(^)" . (21) 

If Lip(/) < L for some L G M_|_, we say that / is L-Lipschitz continuous, and Lip(/) is called the 
Lipschitz constant of /. 

Remark 1. Lip(-) defines a pseudo-norm on the space of all functions. 

The Shannon-theoretic fundamental limits of noiseless compressed sensing are defined as follows. 

Definition 5. Let X"^ be a random vector consisting of independent copies of X. Define the 
minimum e-achievable rate to be the minimum oi R > such that there exists a sequence of 
encoders /„ : M" ML^"J and decoders gn : ML^"J W, such that 

P{5„(/n(X"))/X"}<6 (22) 

for all sufficiently large n. The minimum e-achievable rate is denoted by R*{X,e),R{X,e) and 
R{X, e) depending on the class of allowable encoders and decoders as specified in Table 1.® 

^Throughout the paper, ||-|| denotes the £2 norm on the EucUdean space. It should be noted that the proof in the 
present paper reUes crucially on the inner product structure endowed by the £2 norm. See Remark 5. 

®It was shown in [12] that in the definition of R* and R, the continuity constraint can be replaced by Borel 
measurability without changing the minimum rate. 
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Table 1: Regularity conditions of encoder/decoders and corresponding minimum e-achievable rates. 



Encoder 


Decoder 


Minimum e-achievable rate 


Linear 


Continuous 


R*{X,e) 


Continuous 


Lipschitz 




Linear 


Lipschitz 


R{X,e) 



Remark 2. In Definition 5, R and R are defined under the Lipschitz continuity assumption of the 
decoder, which does not preclude the case where the Lipschitz constants blow up as the dimension 
grows. For practical applications, decoders with bounded Lipschitz constants are desirable, which 
amounts to constructing a sequence of decoders with Lipschitz constant only depending on the 
rate and the input statistics. As we will show later, this is indeed possible for discrete-continuous 
mixtures. 

3.2 Results 

The following general result holds for any input process [12]: 

Theorem 3. For any X and any < e < 1, 

R*{X, e) < R(X, e) < R(X, e). (23) 

Moreover, (23) holds for arbitrary input processes that are not necessarily i.i.d.. 

The second inequality in (23) follows from the definitions, since 

R{X,e) > max{R*(X,e),R(X,e)}. 

Far less intuitive is the first inequality, proved in [12, Section V], which states that robust recon- 
struction is always harder to achieve than linear compression. 

The following result is a finite-dimensional version of the general achievability result of linear 
encoding in [12, Theorem 18], which states that sets of low Minkowski dimension can be linearly em- 
bedded into low-dimensional Euclidean space probabilistically. This is a probabilistic generalization 
of the embeddability result in [37]. 

Theorem 4. Let X'"' he a random vector with dimg(Px") ^ k- Let m > k. Then for Lebesgue 
almost every A G M™^", there exists a (l — ^-Holder continuous function g : M*" — t- M", i.e., 

\\g{x) — g{y)\\ < L\\x — y\\^^^ for some L > and all x, y, such that ¥ {g{AX'^) / X"} < e. 

Remark 3. In Theorem 4, the decoder can be chosen as follows: by definition of dimg, there exists 
U C M", such that dimB(C/) < k. Then if j;" is the unique solution to the linear equation Ax" = y^ 
in [/, the decoder outputs g{y'^) = x"; otherwise g{y^) = 0. 

Generalizing [12, Theorem 9], a non-asymptotic converse for Lipschitz decoding is the following: 

Theorem 5. For any random vector X"' , if there exists a Borel measurable f : M" — t- M'^ and a 
Lipschitz continuous g :R'' such that F {gi fiX"^)) / X"} < e, then 

k > (Px" ) > ^(^") - en. (24) 
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Proof. Section 6.2. □ 
Remark 4. An immediate consequence of (24) is that for general input processes, we have 

R{X, e) > hm sup - e. (25) 

n— >oo n 

which, for i.i.d. inputs, becomes 

R{X,e)>d{X)-e. (26) 

In fact, combining the left inequality in (24) and the following concentration-of-measure result [12, 
Theorem 14]: for any < e < 1, 



liminf ^^"^^^^^"^ >d(X), (27) 

n— >oo n 

(26) can be superseded by the following strong converse: 

R{X,e)>d{X). (28) 

General achievability results for R(X, e) rely on rectifiability results from geometric measure theory 
[38]. See [12, Section VII]. 

For discrete-continuous mixtures, we show that linear encoders and Lipschitz decoders can be 
realized simultaneously with bounded Lipschitz constants. 

Theorem 6 (Linear encoding: discrete-continuous mixture) . Let Px be a discrete-continuous mixed 
distribution of the form (3), with the weight of the continuous part equal to 7. Then 

R*(X,e)=(i(X)=7 (29) 

for all < e < 1. Moreover, if the discrete part P^ has finite entropy, then for any rate R > d{X), 
the decompressor can be chosen to be Lipschitz continuous with respect to the ll.2-'norm with a 
Lipschitz- constant independent of n: 

Consequently, 

R(X,e) = R*(X,e) = d(X) = 7. (31) 



Proof. Section 6.2. □ 

Combining Theorem 6 and [12, Theorem 10] yields the following tight result: for any i.i.d. 
input with a common distribution of the discrete-continuous mixture form in (3), whose discrete 
component has finite entropy, we have 

R*(X,e) = R(X,e) = R(X,e) = 7 (32) 

for all < e < 1. In the special case of sparse signals {P^ = 5q) with s = jn non-zeros, this 
implies that roughly s linear measurements are sufficient to recover the unknown vector with high 
probability. This agrees with the well-known result that s -|- 1 measurements are both necessary 
and sufficient to reconstruct an s-sparse vector probabilistically (see, e.g., [39]). 
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Remark 5. In the achievability proof of Theorem 6, our construction of a sequence of Lipschitz 
decoders with bounded Lipschitz constants independent of the dimension n only works for recovery 
performance measured in the £2 norm. The reason is two- fold: First, the Lipschitz constant of a 
linear mapping with respect to the £2 norm is given by its maximal singular value, whose behavior 
for random measurement matrices is well studied. Second, Kirszbraun's theorem states that any 
Lipschitz mapping from a subset of a Hilbert space to a Hilbert space can be extended to the whole 
space with the same Lipschitz constant [40, Theorem 1.31, p. 21]. This result fails for general 
Banach spaces, in particular, for M"" equipped with any £p-norm {p 7^ 2) [40, p. 20]. Of course, by 
the equivalence of norms on finite-dimensional spaces, it is always possible to extend to a Lipschitz 
function with a larger Lipschitz constant; however, such soft analysis does not control the size of 
the Lipschitz constant, which may possibly blow up as the dimension increases. Nevertheless, (28) 
shows that even if we allow a sequence of decompressors with Lipschitz constants that diverges as 
n — )• 00, the compression rate is still lower bounded by d{X). 

Remark 6 (Behavior of the Lipschitz constant). The Lipschitz constant of the decoder is a proxy 
to gauge the decoding robustness. It it interesting to investigate what is the smallest attainable 
Lipschitz constant as a for a given rate R > ■j. Note that the constant in (30) depends exponentially 
on which implies that the decoding becomes increasingly less robust as the rate approaches 
the fundamental limit. For sparse signals (P^ = 5o hence H{Pd) = 0), (30) reduces to 

'eR / R\R-i / 1 \ «-T 



It is unclear whether it is possible to achieve a Lipschitz constant that diverges polynomially as 
R-^j. 

Remark 7. Although too computationally intensive and numerically unstable (in fact discon- 
tinuous in general), in the conventional compressed sensing setup, the optimal decoder is an £q- 
minimizer that seeks that sparsest solution compatible with the linear measurements. In our 
Bayesian setting, such a decoder does not necessarily minimize the probability of selecting the 
wrong signal. However, the ^o-miiiiniization decoder does achieve the asymptotic fundamental 
limit R*{X, e) for any sparse Px = (1 — 7)(5o + 7-fc) since it is, in fact, even better than the asymp- 
totically optimum decoder described in Remark 3. The optimality of the £o-™iii™ization decoder 
for sparse signals has also been observed in [17, Section IV-Al] based on replica heuristics. 

Remark 8. Converse results for any linear encoder and decoder pair have been proposed before 
in other compressed sensing setups. For example, the result in [41, Theorem 3.1] assumes noiseless 
measurement with arbitrary sensing matrices and recovery algorithms, dealing with best sparse 
approximation under £i/£i-stability guarantee. The following non-asymptotic lower bound on the 
number of measurements is shown: if there exist a sensing matrix A € M'^^", a decoder 5 : M'^ — 
i?0 (•s)'' and a constant C > 0, such that 

IIj; — g(Aa;)||i < min Cllx — zlli (34) 

slog — 

for any x E M", then k > iog(44-2(7) ' However, this result does not directly apply to our setup because 
we are dealing with £2 /^2-stability guarantee with respect to the measurement noise, instead of the 
sparse approximation error of the input vector. 



^Bq{s) — {x € M" : llxllo < s} denotes the collection of all s-sparse n-dimensional vectors. 
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4 Noisy compressed sensing 
4.1 Setup 

The basic setup of noisy compressed sensing is a joint source-channel coding problem as shown 
in Fig. 2, where we assume that 



Encoder 



Decoder 



Figure 2: Noisy compressed sensing setup. 



The source X"' consists of i.i.d. copies of a real-valued random variable X with unit variance. 

The channel is stationary memoryless with i.i.d. additive Gaussian noise aN^ where ~ 
AA(0,Ifc). 

Unit average power constraint on the encoded signal: 

\n\fn{Xn\\l] < 1. (35) 



The reconstruction error is gauged by the per-symbol MSE distortion: 



= -||x"-x"||^. (36) 
n 



In this setup, the fundamental question is: For a given noise variance and measurement rate, 
what is the lowest reconstruction error? For a given encoder /, the corresponding optimal decoder 
g is the MMSE estimator of the input X^ given the channel output Y'^ = f{X^) + aN^. Therefore 
the optimal distortion achieved by encoder / is 



infE 
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X"-5(y'^)f = mmse(X"|/(X") + aiV'=). (37) 



In the case of noiseless compressed sensing, the interesting regime of measurement rates is 
between zero and one. When the measurements are noisy, in principle it makes sense to consider 
measurement rates greater than one in order to combat the noise. Nevertheless, the optimal phase 
transition for noise sensitivity is always less than one, because with k = n and an invertible 
measurement matrix, the linear MMSE estimator achieves bounded noise sensitivity for any noise 
variance. 



4.2 Distortion-rate tradeoff 

For a fixed noise variance cj^ , we define three distortion-rate functions that correspond to optimal 
encoding, optimal linear encoding and random linear encoding, respectively. In the remainder of 
this section, we fix k = [Rn\ . 
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4.2.1 Optimal encoder 

Definition 6. The minimal distortion achieved by the optimal encoding scheme is given by: 

D*{X,R,a^) ^ limsup-inf |mmse(X"|/(X") +aiV'=): E[||/(X")||2] < k} . (38) 



For stationary ergodic sources, the asymptotic optimization problem in (38) can be solved by 
applying Shannon's joint source-channel coding separation theorem [42, Section XI], which states 
that the lowest rate, R, that achieves distortion D is given by 

^ " ^^^^ 

where Rx{-) is the rate-distortion function of X in (9) and C(o"^) = ^ log(l + cr~^) is the AWGN 
channel capacity. By the monotonicity of the rate-distortion function, we have 

D* {X, R, a') = R-^^ (I log(l + a-^)^ . (40) 

In general, optimal joint source-channel encoders are nonlinear [43]. In fact, Shannon's separation 
theorem states that the composition of an optimal lossy source encoder and an optimal channel 
encoder is asymptotically optimal when blocklength n — )• oo. Such a construction results in an 
encoder that is finite-valued, hence nonlinear. For fixed n and A;, linear encoders are in general 
suboptimal. 



4.2.2 Optimal linear encoder 

To analyze the fundamental limit of conventional noisy compressed sensing, we restrict the 
encoder / to be a linear mapping, denoted by a matrix H G M'^^". Since X" are i.i.d. with zero 
mean and unit variance, the input power constraint (35) simplifies to 

E[||HX"||^] = E[X"ThThX"] = Tr(H'rH) = ||H||^ < k, (41) 

where ||-||p denotes the Frobenius norm. 

Definition 7. Define the optimal distortion achievable by linear encoders as: 

L»l(^,-R,ct2) 4 limsup-inf |mmse(X"|HX" + CTiV''): ||H||p < A:) . (42) 



4.2.3 Random linear encoder 

We consider the ensemble performance of random linear encoders and relax the power constraint 
in (41) to hold on average: 

< k. (43) 

In particular, we focus on the following ensemble of random sensing matrices, for which (43) holds 
with equality: 
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Definition 8. Let A„ he a k x n random matrix with i.i.d. entries of zero mean and variance -. 
The minimal expected distortion achieved by this ensemble of linear encoders is given by: 

Di^{X,R,a^) = limsup-mmse(X"|(A„X" + f7A^'=,A„)) (44) 
= limsupmmse(Xi|(A„X" + o-iV'=,A„)) (45) 

n—^oo 

where (45) follows from symmetry and mmse(-|-) is defined in (15).^ 

General formulae for Di^{X, R,a^) and D^{X,R,a'^) are yet unknown. One example where 
they can be explicitly computed is given in Section 4.4 - the Gaussian source. 

4.2.4 Properties 

Theorem 7. 1. For fixed , D*{X,R,a'^) and D^{X,R,a'^) are both decreasing, convex and 
continuous in R on (0,cxd). 

2. For fixed R, D*{X^R,a'^) and D^{X, R, a"^) are both decreasing, convex and continuous in ^ 
on (0, oo). 

3. 

D*{X, R, (j2) < Dl{X, R, a^) < Di,{X, R, a^) < 1. (46) 



Proof. 1. Fix (T^. Monotonicity with respect to the measurement rate R is straightforward from 
the definition of D* and Z)£. Convexity follows from time-sharing between two encoding 
schemes. Finally, convexity on the real line implies continuity. 

2. Fix R. For any n and any encoder /: M" — ^ M'^, i-> mmse(X"|/(X") -|-cjA^'^) is increasing. 
This is a consequence of the infinite divisibility of the Gaussian distribution as well as the 
data processing lemma of MMSE [44]. Consequently, o"^ i— )• D* {X, R,a'^) is also increasing. 
Since D* can be equivalently defined as 

Z)*(X,i?,a2) =limsup-inf |mmse(X"|/(X") + iV'=) :E[||/(X")||2] < A), (47) 
n^oo n f y } 

convexity in ^ follows from time-sharing. The results on follows analogously. 

3. The leftmost inequality in (46) follows directly from the definition, while the rightmost in- 
equality follows because we can always discard the measurements and use the mean as an 
estimate. Although the best sensing matrix will beat the average behavior of any ensem- 
ble, the middle inequality in (46) is not quite trivial because the power constraint in (35) is 
not imposed on each matrix in the ensemble. The proof of this inequality can be found in 
Appendix A. □ 

Remark 9. Alternatively, the convexity properties of D* can be derived from (40). Since Rx{') 
is decreasing and concave, R^^{-) is decreasing and convex, which, composed with the concave 
mapping cr~^ i-> ^ log(l -|- cr~^), gives a convex function cr~^ D*{X,R,a'^) [45, p. 84]. The 
convexity of R^ D*{X, R, a^) can be similarly proved. 

*The MMSE on the right-hand side of (44) and (45) can be computed by first fixing the sensing matrix A„ then 
averaging with respect to its distribution. 
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Remark 10. Note that the time-sharing proofs of Theorem 7.1 and 7.2 do not work for Di^, because 
time-sharing between two random hnear encoders results in a block-diagonal matrix with diagonal 
submatrices each filled with i.i.d. entries. This ensemble is outside the scope of random matrices 
with i.i.d. entries considered in Definition 8. Therefore, proving the convexity of i? i— )■ Di^{X, R, cr^) 
amounts to showing that replacing all zeroes in the block-diagonal matrix with independent entries 
of the same distribution always helps with the estimation. This is certainly not true for individual 
matrices. 

4.3 Phase transition of noise sensitivity 

One of the main objectives of noisy compressed sensing is to achieve robust reconstruction, 
obtaining a reconstruction error that is proportional to the noise variance. To quantify robustness, 
we analyze noise sensitivity, namely the ratio between the mean-square error and the noise variance, 
at a given R and o"^. As a succinct characterization of robustness, we focus particular attention on 
the worst-case noise sensitivity: 

Definition 9. The worst-case noise sensitivity of optimal encoding is defined as 

r(X,fi)=sup ^*^^'f'"'^ (48) 

For linear encoding, {X, R) and Cl (X, R) are analogously defined with D* in (48) replaced by 
and respectively. 

Remark 11. In the analysis of LASSO and the AMP algorithms [13], the noise sensitivity is 
defined in a minimax fashion where a further supremum is taken over all input distributions that 
have an atom at zero of mass at least 1 — e. In contrast, the sensitivity in Definition 9 is a Bayesian 
quantity where we fix the input distribution. Similar notion of sensitivity has been defined in [16, 
Equation (49)]. 

The phase transition threshold of the noise sensitivity is defined as the minimal measurement 
rate R such that the noise sensitivity is bounded for all noise variance [13, 1]: 

Definition 10. Define 

n* {X) = mf {R> 0: C{X,R) < oo} . (49) 

For linear encoding, TZ'^{X) and TZi,{X) are analogously defined with C* in (49) replaced by 
Cl. 

By (46), the phase transition thresholds in Definition 10 are ordered naturally as 

<TZ*{x) <ni{x) <nL{x) <i, (50) 

where the rightmost inequality is shown below (after Theorem 8). 

Remark 12. In view of the convexity properties in Theorem 7.2, the three worst-case sensitivities 
in Definition 9 are all (extended real-valued) convex functions of R. 

Remark 13. Alternatively, we can consider the asymptotic noise sensitivity by replacing the 
supremum in (48) with the limit as o"^ — )• 0, denoted by ^*,(,^ and respectively. Asymptotic 
noise sensitivity characterizes the convergence rate of the reconstruction error as the noise variance 
vanishes. Since D*{X, R, cr^) is always bounded above by varX = 1, we have 

C{X,R) <oo^C{X,R) <oo. (51) 
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Therefore TZ*{X) can be equivalently defined as the infimum of all rates R > 0, such that 

D*{X,R,a^) = 0{a^), ^ 0. (52) 

This equivalence also applies to D£ and Di^. It should be noted that although finite worst-case 
noise sensitivity is equivalent to finite asymptotic noise sensitivity, the supremum in (51) need not 
be achieved as o"^ — t- 0. An example is given by the Gaussian input analyzed in Section 4.4. 

4.4 Least-favorable input: Gaussian distribution 

In this section we compute the distortion-rate tradeoffs for the Gaussian input distribution. 
Although Gaussian input distribution is not directly relevant for compressed sensing due to its lack 
of sparsity, it is still interesting to investigate the distortion-rate tradeoff in the Gaussian case for 
the following reasons: 

1. As the least-favorable input distribution, Gaussian distribution simultaneously maximizes all 
three distortion-rate functions subject to the variance constraint and provides upper bounds 
for non-Gaussian inputs. 

2. Connections are made to classical joint-source-channel-coding problems in information theory 
about transmitting Gaussian sources over Gaussian channels and (sub)optimality of linear 
coding (e.g., [46, 47, 48]). 

3. It serves as an concrete illustration of the phenomenon of coincidence of all thresholds defined 
in Definitions 6-8, which are fully generalized in Section 4.5 to the mixture model. 

Theorem 8. Let Xq ~ M{0, 1). Then for any R, and X of unit variance, 

D*{X,R,a^)<D*{XG,R,a^)= j—^-^. (53) 

DUX,Ry) < DUX,,R,a^) = 1 - ^.^J^^, ,,^ (54) 

DUX, R, (7^) < D^Xg, R,a^)=^(^l-R-a^ + ^(1 - i?)2 + 2(1 + fl)c72 + a^) (55) 



Proof. Since the Gaussian distribution maximizes the rate-distortion function pointwise under the 
variance constraint [49, Theorem 4.3.3], the inequality in (53) follows from (40). For linear encoding, 
linear estimators are optimal for Gaussian inputs since the channel output and the input are jointly 
Gaussian, but suboptimal for non-Gaussian inputs. Moreover, the linear MMSE depends only on 
the input variance. Therefore the inequalities in (54) and (55) follow. The distortion-rate functions 
of Xq are computed in Appendix B. □ 

The Gaussian distortion-rate tradeoffs in (53) - (55) are plotted in Figs. 3 and 4. We see that 
linear encoders are optimal for lossy encoding of Gaussian sources in Gaussian channels if and only 
if i? = 1, i.e., 

Z)*(XG,l,a2) = D£(Xg,1,ct2), (56) 

which is a well-known fact [46, 47]. As a result of (55), the rightmost inequality in (50) follows. 

Next, using straightforward limits, we analyze the high-SNR asymptotics of (53) - (55). The 
smallest among the three, D*{Xq, R,a'^) vanishes polynomially in o"^ according to 

D*{XQ,R,a^) = a^^ + 0{a^^+^), ^ (57) 
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Distortion 




Figure 4: D*{Xq, R,a^), DI{Xg, R,a^), Di^{Xg, R,a'^) against ii when = 1. 
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regardless of how small R > is. For linear encoding, we have 

^1-R + Ra^ + 0{a^) < R < 1, 

-- < cj2 + 0(cj4) R = l, 

^i + 0{a^) R>1. 



Dl{X,R,a^ 



(58) 



Z)L(^G,i?,CT') = < 



a-^ + 0(c73) 



0<R<1, 
R = l, 
R>1. 



(59) 



The weak-noise behavior of D£ and Dl are compared in different regimes of measurement rates: 

• < i? < 1: both and Dl converge to 1 — -R > 0. This is an intuitive result, because even 
in the absence of noise, the orthogonal projection of the input vector onto the nullspace of the 
sensing matrix cannot be recovered, which contributes a total mean-square error of (1 — R)n; 
Moreover, Dl has strictly worse second-order asymptotics than Z)£, especially when R is close 
to 1. 

• R = 1: Di^ = cr(l + o(l)) is much worse than D£ = 0"^(1 + o(l)), which is achieved by 
choosing the encoding matrix to be identity. In fact, with nonnegligible probability, the 
optimal estimator that attains (55) blows up the noise power when inverting the random 
matrix; 



R> 1: both and L>l behave according to G((7^) 
worse, especially when R is close to 1. 



but the scaling constant of is strictly 



The foregoing high-SNR analysis shows that the average performance of random sensing ma- 
trices with i.i.d. entries is much worse than that of optimal sensing matrices, except if i? ^ 1 or 
i? ^ 1. Although this conclusion stems from the high-SNR asymptotics, we test it with several 
numerical results. Fig. 3 (i? = 0.3 and 5) and Fig. 4 (o"^ = 1) illustrate that the superiority of 
optimal sensing matrices carries over to the regime of non- vanishing o"^. However, as we will see, 
randomly selected matrices are as good as the optimal matrices (and in fact, optimal nonlinear 
encoders) as far as the phase transition threshold of the worst-case noise sensitivity is concerned. 

From (57) and (59), we observe that both and exhibit a sharp phase transition near the 
critical rate R = \: 



lim DUX,R,a^) = lim DdX,R,a^) 

= (l-fl)+. 



(60) 
(61) 



where = max{0,x}. Moreover, from (53) - (55) we obtain the worst-case and asymptotic noise 
sensitivity functions for the Gaussian input as follows: 



C{Xg,R) 



'exp(-i?/i(;^)) R>1 



oo 



R < 1 



(62) 





'0 


R>1 


C{Xg,R) = < 


1 


R = l 




oo 


R<1 



(63) 
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and 



oo R <l 



(64) 



oo 



R < 1 



(65) 



The worst-case noise sensitivity functions are plotted in Fig. 5 against the measurement rate R. 
Note that (63) provides an example for Remark 13: for Gaussian input and i? > 1, the asymptotic 
noise sensitivity for optimal coding is zero, while the worst-case noise sensitivity is always strictly 
positive. 

Noise sensitivity 



2.5 



unstable 



-oo 




Figure 5: Worst-case noise sensitivity C,* , Cl and Cl for the Gaussian input, which all become infinity 
when i? < 1 (the unstable regime). 



In view of (63) - (65), the phase-transition thresholds in the Gaussian signal case are: 

n*{XG) = TZUXg) = UUXg) = 1. (66) 

The equality of the three phase-transition thresholds turns out to hold well beyond the Gaussian 
signal model. In the next subsection, we formulate and prove the existence of the phase thresholds 
for all three distortion-rate functions and discrete-continuous mixtures, which turn out to be equal 
to the information dimension of the input distribution. 



4.5 Non-Gaussian inputs 

This subsection contains our main results, which show that the phase transition thresholds are 
equal to the information dimension of the input, under rather general conditions. Therefore, the 
optimality of random sensing matrices in terms of the worst-case sensitivity observed in Section 4.4 
carries over well beyond the Gaussian case. Proofs are deferred to Section 6.3. 
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The phase transition threshold for optimal encoding is given by the upper information dimension 
of the input: 

Theorem 9. For any X that satisfies (5), 

n*{X) = d{X) (67) 
Moreover, if Px is a discrete- continuous mixture as in (3), then for any R>j, as a — )• 0, 

exp(2//(Pd)V-2P(Pc 



D*{X, R, a') 



2(1-7) 

(1-7) ^ 7 



(J -< (l + o(l)) 



(68) 



where T>{-) denotes the non-Gaussianness of a probability measure, defined as its relative entropy 
with respect to a Gaussian distribution with the same mean and variance. Consequently, the asymp- 
totic noise sensitivity of optimal encoding is 



e{x,R) 



oo R < J 

exp(2/f(Pd)V-2»(Pc)) 



2(1-7) 

(1-7) T 7 



i? = 7 
-R > 7. 



(69) 



The next result shows that random linear encoders with i.i.d. Gaussian coefficients also achieve 
information dimension for any discrete-continuous mixtures, which, in view of Theorem 9, implies 
that, at least asymptotically, (random) linear encoding suffices for robust reconstruction as long as 
the input distribution contains no singular component. 

Theorem 10. Assume that X has a discrete- continuous mixed distribution as in (3), where the 
discrete component P'^ has finite entropy. Then 

n*{x) = ni{x) = ni^{x) = -f. (70) 

Moreover, 

1. (70) holds for any non-Gaussian noise distribution with finite non-Gaussianness. 

2. For any R > j, the worst-case noise sensitivity of Gaussian sensing matrices is upper bounded 
by 

R' (R\^< / 2g(Pd)(l-7) + 2^(7) , ,\ 



Remark 14. The achievability proof of lZi,{X) is a direct application of Theorem 6, where the 
Lipschitz decompressor in the noiseless case is used as a suboptimal estimator in the noisy case. The 
outline of the argument is as follows: suppose that we have obtained a sequence of linear encoders 
and Lij-Lipschitz decoders {(A„,g„)} with rate R and error probability — )• as n — )• oo. Then 



E 



\N 



fc||2 



+ en = kL^{R)a'^yarN + e^. 



(72) 



which implies that robust reconstruction is achievable at rate R and the worst-case noise sensitivity 
is upper bounded by Lj^R. 
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Notice that the above achievabihty approach apphes to any noise with finite variance, without 
requiring that the noise be additive, memoryless or that it have a density. In contrast, rephca-based 
results rely crucially on the fact that the additive noise is memoryless Gaussian. Of course, in order 
for the converse (via T^l) hold, the non-Gaussian noise needs to have finite non-Gaussianness. 
The disadvantage of this approach is that currently it lacks an explicit construction because the 
extendability of Lipschitz functions (Kirszbraun's theorem) is only an existence result which relies 
on the Hausdorff maximal principle [40, Theorem 1.31, p. 21], which is equivalent to the axiom of 
choice. On Euclidean spaces it is possible to obtain an explicit construction by applying the results 
in [50, 51] to a countable dense subset of the domain. However, such a construction is far from 
being practical. 

Remark 15. We emphasize the following "universality" aspects of Theorem 10: 

• Gaussian random sensing matrices achieve the optimal transition threshold for any discrete- 
continuous mixture, as long as it is known at the decoder; 

• The fundamental limit depends on the input statistics only through the weight on the analog 
component, regardless of the specific discrete and continuous components. In the conventional 
sparsity model (2) where Px is the mixture of an absolutely continuous distribution and a 
mass of 1 — 7 at 0, the fundamental limit is 7; 

• The suboptimal estimator used in the achievabihty proof comes from the noiseless Lipschitz 
decoder, which does not depend on the noise distribution, or even its variance; 

• The conclusion holds for non-Gaussian noise as long as it has finite non-Gaussianness. 

4.6 Results relying on replica heuristics 

Based on the statistical-physics approach in [52, 53], the decoupling principle results in [53] 
were imported into the compressed sensing setting in [14] to postulate the following formula for 
Di^{X, R, a'^). Note that this result is based on replica heuristics currently lacking a rigorous 
justification. 

Replica Symmetry Postulate ([14, Corollary 1, p.5]). 




(73) 



where < r/ < 1 satisfies the following equation [14-, (12) - (13), pp. 4 ^ 5]: 



.9 



1 



IH — ^mmsdX, i]Ra ^). 



(74) 



When (74) has more than one solution, rj is chosen to minimize the free energy 



I{X; ^/i]Ra-^X + N) + -{rj - 1 - logr/). 



(75) 



^In the notation of [14, (12)], 7 and efi correspond to Ra ^ and R in our formulation. 
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In view of the the I-MMSE relationship [54], the solutions to (74) are precisely the stationary 
points of the free energy (75) as a function of rj. In fact it is possible for (73) to have arbitrarily 
many solutions. For an explicit example, see Remark 21 in Section 6.3. 

Note that the solution in (73) does not depend on the distribution of the random measurement 
matrix A, as long as its entries are i.i.d. with zero mean and variance ^. Therefore it is possible 
to employ a random sparse measurement matrix so that each encoding operation involves only a 
relatively few signal components, for example, 

A^J ~ ^<5^ + (1 - p)5o + (76) 

for some < p < 1. In fact, in the special case of p = -^^^j the replica symmetry postulate can be 
rigorously proved [14, Sec. IV] (see also [55, 56]). 

Assuming the validity of the replica symmetry postulate, it can be shown that the phase tran- 
sition threshold for random linear encoding is always sandwiched between the lower and the upper 
MMSE dimension of the input. The relationship between the MMSE dimension and the information 
dimension in (18) plays a key role in analyzing the minimizer of the free energy (75).^*^ 

Theorem 11. Assume that the replica symmetry postulate holds for X . Then for any i.i.d. random 
measurement matrix A whose entries have zero mean and variance - , 

MX) < T^l{X) < ~^(X). (77) 

Therefore if ^{X) exists, we have 

lZi^{X) = &{X) = d{X), (78) 

and in addition, 

DUX,R,a') = ^^&-^a\l + o{l)). (79) 



The general result in Theorem 11 holds for any input distribution but relies on the conjectured 
validity of the replica symmetry postulate. For the special case of discrete-continuous mixtures 
in (3), in view of Theorem 2, Theorem 11 predicts (with the caveat of the validity of the replica 
symmetry postulate) that the phase-transition threshold for Gaussian sensing matrices is 7, which 
agrees with the rigorously proven result in Theorem 10. Therefore, the only added benefit of 
Theorem 11 is to allow singular components in the input distribution. 

Remark 16. In statistical physics, the phase transition near the threshold often behaves according 
to a power law with certain universal exponent, known as the critical exponent [57, Chapter 3]. 
According to (79), as the measurement rate R approaches the fundamental limit d(X), the replica 
method suggests that the optimal noise sensitivity blows up according as the power law -jiz^jjjq, 
where the unit exponent holds universally for all mixture distributions. It remains a open question 
whether this power law behavior can be rigorously proven and whether the optimal exponent is 
one. Note that by using the Lipschitz extension scheme in the proof Theorem 10, we can achieve 
the noise sensitivity in (71), which blows up exponentially as the R — d{X) vanishes and is likely 
to highly suboptimal. 

^"it can be shown that in the hmit of — > 0, the minimizer of (75) when R > !^{X) and R < !^{X) corresponds 
to the largest and the smallest root of the fixed-point equation (73) respectively. 
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Remark 17. In fact, the proof of Theorem 11 shows that the converse part (left inequahty) of 
(77) holds in a much stronger sense: as long as there is no residual error in the weak-noise limit, 
that is, if Di^{X, R, a^) = o(l) as cr^ — )■ 0, then R > ^{X) must hold. Therefore, the converse part 
of Theorem 11 still holds even if we weaken the right-hand side of (52) from 0(c7^) to o(l). 

Remark 18. Assume the validity of the replica symmetry postulate. Combining Theorem 9, 
Theorem 11 and (50) gives an operational proof for d{X) < S>{X), the fourth inequality in (18), 
which has been proven analytically in [29, Theorem 8]. 

5 Comparisons to LASSO and AMP algorithms 

Widely popular in the compressed sensing literature, the LASSO [7, 6] and the approximate 
message passing (AMP) algorithms [9] are low-complexity reconstruction procedures, which are 
originally obtained as solutions to the conventional minimax setup in compressed sensing. In this 
section, we compare the phase transition thresholds of LASSO and AMP achieved in the Bayesian 
setting to the optimal thresholds derived in Sections 3-4. Similar Bayesian analysis has been 
performed in [25, 58, 9, 59, 60]. 

5.1 Signal models 

The following three families of input distributions are considered [9, p. 18915], indexed by 
X = ±, + and □ respectively, which all belong to the family of input distributions of the mixture 
form in (3): 

zb : sparse signals (2); 

-|- : sparse non-negative signals (2) with the continuous component Pc supported on M-(_. 
□ : simple signals (Section 1.3) [25, Section 5.2, p. 540] 

P={l-l){^-6o + \6i)+iPc (80) 
where Pc is some absolutely continuous distribution supported on the unit interval. 

5.2 Noiseless measurements 

In the noiseless case, we consider linear programming (LP) decoders and the AMP decoder [9] 
and the phase transition threshold of error probability. Phase transitions of greedy reconstruction 
algorithms have been analyzed in [61], which derived upper bounds (achievability results) for the 
transition threshold of measurement rate. We focus our comparison on algorithms whose phase 
transition thresholds are known exactly. 

The following LP decoders are tailored to the three input distributions x = + and □ respec- 
tively (see Equations (PI), (LP) and (Feas) in [27, Section I]): 

g±{y)= argmin{llxll^ : X G M", Aj; = y}, (81) 
g+{y)= wgmS.n{\\x\\^:x(^Wl,Ax = y}, (82) 
5D(y) = {x:xe[0,l]",Ax = y}. (83) 

For sparse signals, (81) - (82) are based on £i-minimization (also known as Basis Pursuit [6], which 
is the noiseless limit of LASSO defined in Section 5.3), while for simple signals, the decoder (83) 
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solves an LP feasibility problem. In general the decoders in (81) - (83) output a list of vectors upon 
receiving the measurement. The reconstruction is successful if and only if the output list contains 
only the true vector. The error probability is thus defined as ¥ {gy^{AX''^) ^ {X"}}, evaluated with 
respect to the product measure (Px)^ x Pa- 

The phase transition thresholds of the reconstruction error probability for decoders (81) - (83) 
are derived in [11] using combinatorial geometry. For sparse signals and -£i-minimization decoders 
(81) - (82), the expressions of the corresponding thresholds R±(7) and R+(7) are quite involved, 
given implicitly in [11, Definition 2.3]. As observed in [9, Finding 1], R±(7) and R+(7) agree 
numerically with the following expressions:^^ 

R±(7) = niin 7(1 + a^) + 2(1 - 7)((1 + a2)$(_a) _ aip{a)) (84) 
R^(7) = min7(l + a2) + (1 _ ^)((1 + Q,2^$(_Q,) _ q,(^(q,)) (85) 

which is now rigorously established in view of the results in [63]. For simple signals, the phase 
transition threshold is proved to be [25, Theorem 1.1] 

Rd(7) = (86) 

Moreover, substantial numerical evidence in [9] suggests that the phase transition thresholds for 
the AMP decoder coincide with the LP thresholds for all three input distributions. The suboptimal 
thresholds obtained from (84) - (86) are plotted in Fig. 6 along with the optimal threshold obtained 
from Theorem 6 which is 7.^^ In the gray area below the diagonal in the (7, i2)-phase diagram, 
any sequence of sensing matrices and decompressors will fail to reconstruct the true signal with 
probability that tends to one. Moreover, we observe that the LP and AMP decoders are severely 
suboptimal unless 7 is close to one. 

In the highly sparse regime which is most relevant to compressed sensing problems, it follows 
from [27, Theorem 3] that for sparse signals (x = ± or +), 

R^(7) = 27log, i(l + 0(1)), as 7 ^ 0, (87) 

which implies that R^^. has infinite slope at 7 = 0. Therefore when 7^1, the ii and AMP decoders 
require on the order of 2s logg ^ measurements to successfully recover the unknown vector, whose 
number of nonzero components is denoted by s. In contrast, s measurements suffice when using an 
optimal decoder (or ^o-Hiinimization decoder) . The LP or AMP decoders are also highly suboptimal 
for simple signals, since Rn(7) converges to ^ instead of zero as 7 — )• 0. This suboptimality is 
due to the fact that the LP feasibility decoder (83) simply finds any in the hypercube [0, 1]"" 
that is compatible with the linear measurements. Such a decoding strategy does not enforce the 
typical discrete structure of the signal, since most of the entries saturate at or 1 equiprobably. 
Alternatively, the following decoder achieves the optimal 7: define 

1 " 

"•"(^"^ " n ^ (1{^«=0}' l{x,0{o,i}}. l{x,=i}) • 

The decoder outputs the solution to Ax" = y'^ such that T{x'^) is closest to ^"^^571 (i^i total 
variation distance for example). 

^^In the series of papers [25, 27, 9, 13], the phase diagrams are parameterized by (p, 5), where 5 = R is the 
measurement rate and p = ^ is the ratio between the sparsity and rate. In this paper, the parameterization (7, R) 
is used instead. The ratio -^-^ is denoted by p{j;x) in [9]- The same parametrization is also used in [62]. 

^^A similar comparison between the suboptimal threshold R±(7) and the optimal threshold 7 has been provided 
in [17, Fig. 2(a)] based on a replica-heuristic calculation. 
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5.3 Noisy measurements 

In the noisy case, we consider the AMP decoder [13] and the £i-penahzed least-squares (i.e. 
LASSO) decoder [7]: 

1 2 

9{y, A; A) = argmin - \\y - Ax\\2 + A , (88) 

where A > is a regularization parameter. Note that in the Hmit of A — )• 0, LASSO reduces to the 
£i-minimization decoder defined in (81). For Gaussian sensing matrices and Gaussian observation 
noise, the asymptotic mean-square error achieved by LASSO for a fixed A 

D'-^\X,R,a^) ^ Um -E Hi AT" - ^(AX'^ + aiV^ A) f 1 (89) 

can be determined as a function of Px, A and a by applying [59, Corollary 1.6].^^ In Appendix C, 
we show that for any X distributed according to the mixture 

Px = il- 7)5o + iQ, (90) 



^^It should be noted that in [13, 59], the entries of the sensing matrix is distributed according to A/'(0, ^) (column 
normalization). While in the present paper the sensing matrix has ^^(0, ^) entries (row normalization) in order for 
the encoded signal to have unit average power. Therefore the expression in (92) is equal to that in [13, Equation 
(1.9)] divided by the measurement R. 
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where Q is an arbitrary probability measure with no mass at zero, the asymptotic noise sensitivity 
of LASSO with optimized A is given by the following equation: 

i{X,R) ^ inf hm ^^'^^^'^'^ (91) 
\oo i?<R±(7) 

where R±(7) is given in (85). By the same reasoning in Remark 13, the worst-case noise sensitivity 
of LASSO is finite if and only if i? > R±(7). Note that (92) does not depend on Q as long as 
(5({0}) = 0. Therefore R±(7) also coincides with the phase transition threshold in a minimax 
sense, obtained in [13, Proposition 3. 1(1. a)] by considering the lease favorable Q. Analogously, 
the LASSO decoder (88) can be adapted to other signal structures (see for example [13, Sec. VI- 
A]), resulting in the phase-transition threshold R+(7) and Rn(7) for sparse positive and simple 
signals, given by (85) and (86), respectively. Furthermore, these thresholds also apply to the AMP 
algorithm [64]. 

Next, focusing on sparse signals, we compare the performance of LASSO and AMP algorithms 
to the optimum. In view of (92), the phase transition thresholds of noise sensitivity for the LASSO 
and AMP decoder are both R±(7) for any X distributed according to (90). We discuss the following 
two special cases: 

1. Q is absolutely continuous, or alternatively, Px is a discrete-continuous mixture given in (2). 
The optimal phase transition threshold is 7 as a consequence of Theorem 10. Therefore the 
phase transition boundaries are identical to Fig. 6 and the same observation in Section 5.2 
applies. 

2. Q is discrete with no mass at zero, e.g., Q = + Since Px is discrete with zero 
information dimension, the optimal phase transition threshold is equal to zero, while R±(7) 
still apphes to LASSO and AMP. 

For sparse signals of the form (2) with 7 = 0.1, Fig. 7 compares those expressions for the asymptotic 
noise sensitivity of LASSO (and AMP) algorithm to the optimal noise sensitivity predicted by 
Theorem 11 based on replica heuristics. Note that the phase transition threshold of LASSO is 
approximately 3.3 times the optimal. 

6 Proofs 

6.1 Auxiliary results 

We need the following large-deviations result on Gaussian random matrices. 

Lemma 1. Let (Tj^miBk) denote the smallest singular value of the k x ruk matrix Bk consisting of 
i.i.d. Gaussian entries with zero mean and variance p For any t > 0, denote 

Fk,m,{t)=^{^miniBk) <t}. (93) 

Suppose that ^ ^~^°°> a G (0, 1). Then 

hm mf - log — > log ^ — + - log a. 94 

fc-5-oo k rk,m^\t) 2 et^ 2 
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Proof. For brevity let Hk = VkBk and suppress the dependence of on k. Then H^Hk IS an 
m X m Gaussian Wishart matrix. The minimum eigenvalue of H'^H^ has a density, which admits 
the following upper bound [65, Proposition 5.1, p. 553]. 



k — m — l 



fXn.UHlHk)i^) < Ek,mX 2 e 2, x>0, (95) 

where 



E A ^/^2- — r(^) 



Then 

P{aniin(Sn) < t} = P { A„,in(i/^/i"„) < (97) 

< Ek,m / 2;^^e-2dx (98) 



JO 

k — m-\-\ 

~ - m + 1 ■ 

Applying Stirling's approximation to (99) yields (94). □ 

Remark 19. More general non-asymptotic upper bound on P {(Tinm(-Bfc) < i} is given in [66, 
Theorem 1.1], which holds universally for all sub-Gaussian distributions. Note that the upper 
bound in [66, Equation (1.10)] is of the form 

P {(Tmin(-Bfc) <t]< {citf^^ + exp(-C3A;) (100) 
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where ci,C2,C3 are constants. The second term in (100) is due to the fact that the least singular 
value for discrete ensembles (e.g. Rademacher) always has a mass at zero, which is exponentially 
small in k but independent of t. For Gaussian ensembles, however, (Bk) > almost 

surely. Indeed, Lemma 1 indicates that the second term in (100) can be dropped, which provides a 
refinement of the general result in [66, Theorem 1.1] in the Gaussian case. As shown in Section 6.2, in 
order for the proof of Theorem 6 to work, it is necessary to use ensembles for which P {cTminiBk) < t} 
can be upper bounded asymptotically by exp{—kE{t)), where E{t) — t- oo as t vanishes. 

The next lemma upper bounds the probability that a Gaussian random matrix shrinks the 
length of some vector in an affine subspace by a constant factor. The point of this result is that 
the bound depends only on the dimension of the subspace but not the basis. 

Lemma 2. Let A be a k x n random matrix with i.i.d. Gaussian entries with zero mean and 
variance ^. Let R = K Let k > m. Then for any m- dimensional affine subspace U o/M", 

pi inf M^<A<i7 (101) 
txec/\{o} ||x|| J ' ^ ^ 



Proof. By definition, there exists v G M" and an m-dimensional linear subspace V such that 
U = V + V. First assume that v ^ V. Then ^ U. Let {vo, . . . , Vm} be an orthonormal basis for 
V = span{v, V). Set = [vq, . . . , Vjji] ■ Then 

• r ll^^ll ll^^ll 

mr — — :— = mm — — :— (102) 

xeu \\x\\ xeV'\{o} \\x\\ 

= mm MM (103) 

j/eM™+i\{o} ||y|| 

= an,in(A^'), (104) 

where (102) is due to the following reasoning: since U C V , it remains to establish inixeu ^^^x/ — 
minj.gy/\|o}.. To see this, for any x G V , we have x = av + Py for some a, /3 G M and y & V. 
Without loss of generality, we can assume that a > 0. For each r > 0, define Xr = {a+T)v+f3y G V 
which satisfies Wxj- — xll — t- as r — ?■ 0. Then G U and 

II ' II a+T 



lim 



a+T 



> inf (105) 

x&U \\x\\ 



which, upon minimizing the left-hand side of (105) over x G V , implies the desired (102). In view 
of (104), (101) holds with equality since is a /c x (m + 1) random matrix with i.i.d. normal 
entries of zero mean and variance ^.^^ If v ^ V, then (101) holds with equality and m + l replaced 
by m. The proof is then complete because m i— )• Fk^m{t) is decreasing. □ 

Lemma 3. Let T he a union of N affine subspaces of M" with dimension not exceeding m. Let 
P{X" G r} > 1 — e. Let A he defined in Lemma 2 independent of X"^ . Then 

X" G r, inf \\^^y-^"}\\ >t\>l-e', (106) 



'Note that the entries in the ensemble in Lemma 1 have variance inversely proportional to the number of columns. 
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where 

e' = e + NFk,m{R~n). (107) 

Moreover, there exists a subset E C M'^^" with P{A G > 1 — Ve', suc/i t/iai /or any G E, 
there exists a Lipschitz continuous function gj^ : M.^ — t- M" with Lip((/K) ^ o'^'f^ 

P {(7a(AX") / X"} > 1 - ^Z?. (108) 



Proof. By the independence of and A, 

X^GT, inf MfcZM > A = /■ p^„(dx)p| inf > tj (109) 

yeT\{x} \\y-X\\ j Jt i^e{T-a;)\{o} \\z\\ J 

>F{X'' eT}{l-NFk,m{R~^t)) (110) 

>l-e'. (Ill) 

where (110) follows by applying Lemma 2 to each affine subspace in T — x and the union bound. 
To prove (108), denote by p(K) the probability in the left-hand side of (106) conditioned on the 
random matrix A being equal to K. By Fubini's theorem and Markov's inequality, 

(112) 



|p(A) > 1 - \/?| > 1 



Put E = {K: p(K) > 1 - Ve'}. For each KeE, define 



Uk = -{x£T: inf ''^^^ ""J^^ >t}cT. (113) 



yeT\{x} \\y-x\ 
Then, for any {x,y) £ U^, we have 

||K(x-y)|| >t||x-2/||, (114) 

which implies that K|jy^, the linear mapping K restricted on the set Uk, is injective. Moreover, 
its inverse '■ K([/k) — ?• Uk is j-Lipschitz. By Kirszbraun's theorem [38, 2.10.43], can be 
extended to a Lipschitz function on the whole space M'^ with the same Lipschitz constant. For 
those K(j^ E, set = 0. Since P {X" G Uk} > 1 - \/e' for all K G E, we have 



P{5k(KX") /X"} >P{X" eUA,Ae E} > 1-Ve', (115) 
completing the proof of the lemma. □ 

6.2 Proofs of results in Section 3 

Proof of Theorem 5. To prove the left inequality in (24), denote 

C = {fix"") G M" : 5(/(x")) = x"} C (116) 

Then 



k > dimB(C) (117) 



> dimB(5(C)) (118) 



> dimB(Px"), (119) 

where 



28 



(117) : Minkowski dimension never exceeds the ambient dimension; 

(118) : Minkowski dimension never increases under Lipschitz mapping [67, Exercise 7.6, p. 108]; 

(119) : by P{X" G g{C)} > 1 - e and (20). 



It remains to prove the right inequality in (24). By definition of dimg, for any 6 > 0, there 
exists E such that Px"{E) > 1 — e and dimB(-E) > dimB(Px") — ^- Since Px" can be written as a 
convex combination of Px"\X"(^E Px"\X"^E^ applyiiig [12, Theorem 2] yields 

d{X^) < d{Px-\x^^E)Px-{E) + d{Px^\xHE){^ - Px^E)) (120) 
<'^'^{Px«)-S + en, (121) 

where (121) holds because the information dimension of any distribution is upper bounded by the 
Minkowski dimension of its support [35]. By the arbitrariness of 6, the desired result follows. □ 

Proof of Theorem 6. Let Px be a discrete-continuous mixture as in (3). Equation (29) is proved 
in [12, Theorem 6]. The achievability part follows from Theorem 4, since, with high probability, 
the input vector is concentrated on a finite union of affine subspaces whose Minkowski dimension is 
equal to the maximum dimension of those subspaces. The converse part is proved using Steinhaus' 
theorem [68]. 

It remains to establish the achievability part of (31): R(X, e) < 7. Fix i? > 7. Fix 5,6' > Q 
arbitrarily small. In view of Lemma 3, to prove the achievability of R, it suffices to show that, with 
high probability, X" lies in the union of exponentially many affine subspaces whose dimensions do 
not exceed nR. 

To this end, let Wi = '^{Xi^A}i where A denotes the collection of all atoms of Pd, which is, by 
definition, a countable subset of M. Then {Wi\ is a sequence of i.i.d. binary random variables with 
expectation 7. By the weak law of large numbers. 



1 1 " 

-|spt(X")| = -5^W, 4 7. (122) 



n n 

1=1 

where the generalized support of is defined as 

spt(x") = {i = 1,.. . ,n : Xi ^ ^}. (123) 

For each /c > 1, define 



k 



Since H{Pd) < 00, we have |Tfc| < exp{{H{Pd) + S')k). Moreover, P^(Tk) > 1 — e for all sufficiently 
large k, by the weak law of large numbers. 

Let t(x") denote the discrete part of x", i.e., the vector formed by those Xi € A in increasing 
order of i. Then t(x'") G ^"-|spt{^'")l. Let 

Cn = {x" G M" : ||spt(x")| - jn\ < 5n, t(x'^) G T„_|,pt(,n)| } (125) 
U U ^ ^" • spt(a:") = S, t(x") = z} . (126) 

Sc{l,...,n} 2eT„_|s| 
\\S\-'rn\<Sn 
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Note that each of the subsets in the right-hand side of (126) is an afiine subspace of dimension no 
more than (7 + 6)n. Therefore Cn consists of Nn affine subspaces, with 

r(7+5)nl 



k= [{7— <5)™J 
r(7+5)n 



^ E (f)ewiiHiPA) + 5'){n-k)), (128) 

k=l{'y-5)n\ ^ ^ 

hence 

hmsup-logiV„ < {H{Pd) + 5'){l-^ + 5) + max{h{^ + 5),h{j-5)}. (129) 

n—>oo Tl 

Moreover, by (122), for sufficiently large n, 

P{X" £Cn}= Yl ^ C„,spt(X") = S} (130) 

||5|-7n|<5n 

= Yl ^{^Pt{X-) = S}Pr^'\Tn-\s\) (131) 

l|S|-7n|<5n 

> P{||spt(X")| -7n| <(5n}(l-e) (132) 

> 1 - 2e. (133) 

To apply Lemma 2, it remains to select a sufficiently small but fixed t, such that 

A^nFKn,{7+<5)n(^"5t) =0(1) (134) 

as n — )• oo. This is always possible, in view of (129) and Lemma 1, by choosing t > sufficiently 
small such that 

log ^^^J,"^' + ^ loga > {H{Pi) + 6'){l-j + 6) + max{/i(7 + 6), h{j - 6)}, (135) 

where a = By the arbitrariness of 6 and S', the proof of R{X,e) < 7 is complete. Finally, 

by Theorem 5, the Lipschitz constant of the corresponding decoder is upper bounded by j, which, 
according to (135), can be chosen arbitrary close to the right-hand side of (30) by sending both 6 
and 5' to zero, completing the proof of (30). □ 

6.3 Proofs of results in Section 4 

Proof of Theorem 9. The proof of (49) is based on the low-distortion asymptotics of Rx{D) [32]: 

limsup f^^^.^ =d{X), (136) 
DiQ 2 log D 

Converse: Fix R > TZ*{X). By definition, there exits a > such that D*{X,R,a'^) < aa"^ for 
all 0-2 > 0. By (40), 

5log(l +cr ^) 

Dividing both sides by ^ log and taking limsup^2_^o; obtain R > d{X) in view of (136). By 
the arbitrariness of R, we have TZ*{X) > d{X). 
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Achiev ability: Fix 5 > arbitrarily and let R = ~d{X) + 25. We show that R < TZ*{X), i.e., 
worst-case noise sensitivity is finite. By Remark 13, this is equivalent to achieving (52). By (136), 
there exists Dq > such that for all D < Dq, 



Rx{D)<^^ log-. 



(138) 



By Theorem 7, D*{X, R, cr^) | as crH 0. Therefore there exists erg > 0, such that D*{X, R, cj^) < 
Do for all a'^ < aQ. In view of (40) and (138), we have 



log^ = RxiD*{X,R,a^)) < log 



D*{X, R, (j2)' 



(139) 



I.e., 



D*{X,R,a^) < a -^i^ 



, d(X)+2S 



(140) 



holds for all cj^ < cJq. This obviously implies the desired (52). 

We finish the proof by proving (68) and (69). The low-distortion asymptotic expansion of the 
rate-distortion function of a discrete-continuous mixture with mean-square error distortion is found 
in [69, Corollary 1], which refines (10):^^ as J, 0, 



+ /i(7) + (1 - -f)H{Pd) + -fh{P,) + 0(1) 



2 Tva^ ^ ^^^^ ^ _ ^^^^p^^ _ ^^^p^^ ^ ^^^^ 



(141) 
(142) 



where Px is given by (3). Actually (142) has a natural interpretation: first encode losslessly the 
i.i.d. Bernoulli sequence {A, D, D, D, A, . . .}, where D and A indicate the source realization is in the 
discrete alphabet or not, respectively. Then use lossless and lossy optimal encoding of P^ and Pc 
for the discrete and continuous symbols respectively. What is interesting is that this strategy turns 
out to be optimal for low distortion. Plugging (142) into (40) gives (68), which implies (69) as a 
direct consequence. □ 

Proof of Theorem 10. Let > 7. We show that the worst-case noise sensitivity C,i,{X,R) under 
Gaussian random sensing matrices is finite. We construct a suboptimal estimator based on the Lip- 
schitz decoder in Theorem 6.^^ Let A„ be a A: x n Gaussian sensing matrix and qa^ the correspond- 
ing L/j-Lipschitz decoder, such that k = Rn and P{£'„} = o(l) where En = {gA„{AnX^) ^ X^} 
denotes the error event. Without loss of generality, we assume that gA„iO) = 0. Fix r > 0. Then 



E 

< E 



gAAAnX^ + aN'^ 
gAMnX'^ + aN'^) - X 

2 



+ 2L|E 



+ 2E 



X'' 



HEn} 



< kLj^a' + rn(2Ljj + 1)P {En} + 2E y|X"||^l|||^„,|2>^„| 



+ 2L|E 



AnX"" + aN^\f If.. ^ ^ „,ii2 1 



(143) 



(144) 



^^In fact h{'y) + (1 — j)H{P^) + jh{Pc) is the 7-dimensional entropy of (3) defined by Renyi [28, Equation (4) and 
Theorem 3]. 

^^Since we assume that varX = 1, the finite-entropy condition of Theorem 6 is satisfied automatically. 
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Dividing both sides of (144) by n and sending n — )• oo, we have: for any r > 0, 



hm sup — E 

n— ^oo IT- 



1. 



< RLj^a'^ + 2 sup -E 

n Tl 



1. 



+ 2L|j sup -E 



n 



AnX"^ + aN^\?l 



A„X"+CTA"=||^>rn| 



(145) 



Since ^ HX^H^ 1 and ^ ||A„X" + (tA^'^||^ i?(l + <t^), which implies uniform integrabihty, 
the last two terms on the right-hand side of (145) vanish as r — cx). This completes the proof of 

Cl{x,r)<rlI. □ 

Proof of Theorem 11. Achiev ability: We show that lZi,{X) < &{X). Fix 5 > arbitrarily and let 
R = WiX) + 25. Set s = Ra-^ and /3 = rjs. Define 



u{/3) = f3mm5e{X,f3) - Rl 1 



/3 



R 



f{fi) = I{X-^X + N)--\og{i 



g{P) = /(/?) + 



RP 



(146) 

(147) 
(148) 



which satisfy the following properties: 



1. Since mmse(X, •) is smooth on (0, oo) [70, Proposition 7], u, f and g are all smooth functions 
on (0,oo). Additionally, since E [X^] < oo, u is also right-continuous at zero. In particular, 
by the I-MMSE relationship [54] , 



/3mmse(X,/?) -i? 



2. For all < /3 < s. 



/(/3) < g{P) < /(/?) + 



R 



3. Recalling the scaling law of mutual information in (11), we have 

f{P) d{X)-^{X)-25 ^ ^ 
hm sup = -^—^ < -5, 



/3— ^co 



log/3 



(149) 



(150) 



:i51) 



where the last inequality follows from the sandwich bound between information dimension 
and MMSE dimension in (18). 

Let fig be the root of u in (0, s) which minimizes <?(/?). Note that fig exists since u(0) = —R < 0, 
u{s) = s mmse(X, s) > and u is continuous on [0, oo). According to (73) in the replica symmetry 
postulate, 

Di^{X, R, cr^) = mmse(X, rjss), (152) 
where r]s is the solution of (74) in (0, 1) which minimizes (75), denoted by 



E{7]) = I{X; ^X + AT) + |(r, - 1 - logr/). 



(153) 
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We claim that for any fixed s, 

Vs = -. (154) 
s 

To see this, note that the solutions to (74) are precisely the roots of u scaled by -. Moreover, since 
R * 

2 



E{r]) — g{l3) = §(log s — 1), for any set A C (0, 1), we have 



argmin £^(77) = - argmin (/(/3), (155) 

resulting in (154). Next we focus on the behavior of fSg as s grows. 
Proving the achievability of R amounts to showing that 

iimsup ^ < 00, (15oj 

which, in view of (152) and (154), is equivalent to showing that /3s grows at least linearly as s — ^ 00, 
i.e., 

B 

liminf— >0. (157) 

s—^oo s 

By the definition of S^{X) and (151), there exists B > such that for all P > B, 

Pmmse{X,P) < R-6 (158) 

and 

/{(])< J-log p. (159) 
In the sequel we focus on sufficiently large s. Specifically, we assume that 

s > y max|5,e"t(^-f )| , (160) 

where K = min^g[o,B] 9{(^) is finite by the continuity of g. 
Let 

/3o = ^. (161) 

Then ^Sq > ^ by (160). By (158), m(/3o) = /3o mmse(X,/3o) - R + 5 < 0. Since u{s) > 0, by the 
continuity of u and the intermediate value theorem, there exists (Bq < f3* < s, such that u{f3*) = 0. 
By (158), 

/(/3)<-^<0, y(3>B. (162) 

Hence / strictly decreases on {B,oo). Denote the root of u that minimizes /(/3) by /3^, which must 
lie beyond (3*. Consequently, we have 

B<-=(3o<l3*</3i. (163) 

Next we argue that /3s cannot differ from by a constant factor. In particular, we show that 

f3s > e-^/3's, (164) 
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which, combined with (163), imphes that 



R 



7 ^ 

for all s that satisfy (160). This yields the desired (157). We now complete the proof by showing 
(164). First, we show that that f3s > B. This is because 

g{Ps) < giP's) (166) 

= /(/?;) + ^ (167) 

Is 

< fm + f (168) 

6 ^ 6s R , , 

- "4^°^i? + 2" ^^^^^ 

< K (170) 

= min g(B). (171) 

where 

• (166): by definition, Ps and f3'g are both roots of u and f3s minimizes g among all roots; 

• (168): by (163) and the fact that / is strictly decreasing on {B,co); 

• (169): by (159) and (161); 

• (171): by (160). 

Now we prove (164) by contradiction. Suppose /3s < e~"^/3^. Then 

giP's) - g{Ps) = Y^iP's- Ps) + f{P's) - fiPs) (172) 

R f^'^ ■ 

< 2 + / /(^)d^ (173) 

< 0, (175) 

contradicting (166), where (174) is due to (162). 

Converse: We show that TZi,{X) > &_{X). Recall that TZi^{X) is the minimum rate that 
guarantees that the reconstruction error Di^{X, R,a'^) vanishes according to 0(o"^) as ci^ — )• 0. In 
fact, we will show a stronger result: as long as Di^{X, R, o"^) = o(l) as o"^ — )• 0, we have R > ^{X). 
By (152), Dl{X, R, cj^) = o(l) if and only if Ps ^ oo. Since u{Ps) = 0, we have 

> limsupi? ( 1 - — ) (176) 
= limsup/3s mmse(X, /3s) (177) 

s— >-oo 

> liminf/3mmse(X,/3) (178) 
= ^(X). (179) 
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Asymptotic noise sensitivity: Finally, we prove (79). Assume that &{X) exists, i.e., ^{X) = 
d{X), in view of (18). By definition of &{X), we have 

mmse(X,/3) = ^^+oQ), /3 ^ oo. (180) 

As we saw in the achievability proof, whenever R > ^{X), (157) holds, i.e., r]s = i^(l) as s — t- oo. 
Therefore, as s — )• oo, we have 

- = 1 + |,mmse(X, rj^s) = 1 + ^^(1 + o(l)), (181) 

Vs R VsR 



I.e., 



Vs = l-^ + o{l). (182) 



By the replica symmetry postulate (73), 



Di^iX, R, cj2) = mmse(X, r/^s) (183) 

(184) 



'^-Vs 2 



Vs 

'^^^ -(7^(1 + o(l)). (185) 



R - &{X) 

□ 

Remark 20. Note that Ps is a subsequence parametrized by s, which may take only a restricted 
subset of values. In fact, even if we impose the requirement that Di^{X, Rja"^) = O(cj^), it is 
still possible that the limit in (177) lies strictly between ^{X) and &{X). For example, if X is 
Cantor distributed as defined in (13), then it can be shown that the limit in (177) approaches the 
information dimension d{X) = log^ 2. 

Remark 21 (Multiple solutions in the replica symmetry postulate). Solutions to (74) in the replica 
symmetry postulate and to the following equation in /3 

/3mmse{X,/3) = R-a^p. (186) 

2 

differ only by a scale factor of Next we give an explicit example where (186) can have arbitrarily 
many solutions. Let X be Cantor distributed as defined in (13). According to [29, Theorem 16], 
(3 I— 7- /3mmse(X, /3) oscillates in log^P with period two, as shown in Fig. 8 in a linear-log plot. 
Therefore, as cr^ — >• 0, the number of solutions to (186) grows unbounded according to G (log 
In fact, in order for Theorem 11 to hold, it is crucial that the replica solution be given by the 
solution that minimizes the free energy (75). 



7 Concluding remarks 

In the compressed sensing literature it is common to guarantee that for any individual sparse 
input the matrix will likely lead to reconstruction, or, alternatively, that a single matrix will work 
for all possible signals. As opposed to this worst-case (Hamming) approach, in this paper we adopt 
a statistical (Shannon) framework for compressed sensing by modeling input signals as random 
processes rather than individual sequences. As customary in information theory, it is advisable to 
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Figure 8: Multiple solutions to (186) in the replica symmetry postulate, with Cantor distributed 
X, R = 0.632 and = 3''^'^. 

initiate the study of fundamental limits assuming independent identically distributed information 
sources. Naturally, this entails substantial loss of practical relevance, so generalization to sources 
with memory is left for future work. The fundamental limits apply to the asymptotic regime of large 
signal dimension, although a number of the results in the noiseless case are in fact non-asymptotic 
(see, e.g.. Theorems 4 and 5). 

We have investigated the phase transition thresholds (minimum measurement rate) of recon- 
struction error probability (noiseless observations) and normalized MMSE (noisy observations) 
achievable by optimal nonlinear, optimal linear, and random linear encoders combined with the 
corresponding optimal decoders (i.e. conditional mean estimates). For discrete-continuous mix- 
tures, which are the most relevant for compressed sensing applications, the optimal phase transition 
threshold is shown to be the information dimension of the input, i.e., the weight of the analog part, 
regardless of the specific discrete and absolutely continuous component. The universal optimality 
of random sensing matrices with non-Gaussian i.i.d. entries in terms of phase transition thresholds 
is still unknown. The phase-transition thresholds of popular decoding algorithms (e.g., LASSO 
or AMP decoders) turn out to be far from the optimal boundary. In a recent preprint [71], it 
is shown that using random sensing matrices constructed from spatially coupled error-correcting 
codes [72] and the corresponding AMP decoder, the information dimension can be achieved under 
mild conditions, which are optimal in view of the results in [12]. Designing deterministic sensing 
matrices that attain the optimal thresholds remains an outstanding challenge. 

In contrast to the Shannon theoretic limits of lossless and lossy compression of discrete sources, 
one of the lessons drawn from the results in this paper and [12] is that compressed sensing of every 
(memoryless) process taking values on finite or countably infinite alphabets can be accomplished at 
zero rate, as long as the observations are noiseless. In fact, we have even shown in Theorem 4 a non- 
asymptotic embodiment of this conclusion based on a probabilistic extension of the embeddability 
of fractal sets. In the case of noisy observations, the same insensitivity to the actual discrete 
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signal distribution holds as far as the phase transition threshold is concerned. However, in the 
non-asymptotic regime (i.e. for given signal dimension and signal-to-noise-ratio) the optimum 
rate-distortion tradeoff will indeed depend on the signal distribution. 

In this paper we have assumed a Bayesian setup where the input is i.i.d. with common dis- 
tribution known to both the encoder and the decoder. In contrast, the minimax formulation in 
[13, 62, 73] assumes that the input distribution is a discrete-continuous mixture whose discrete 
component is known to be a point mass at zero, while the continuous component, i.e., the prior of 
the non-zero part, is unknown. Minimax analyses were carried out for LASSO and AMP algorithms 
[13], where the minimum and maximum are with respect to the parameter of the algorithm and 
the non-zero prior, respectively. The results in Section 5 demonstrate that the LASSO and AMP 
algorithms do not attain the fundamental limit achieved by the optimal decoder in the Bayesian 
setup. However, it is possible to improve performance if the input distribution is known to the 
reconstruction algorithm. For example, the message passing decoder in [71] that achieves the op- 
timal phase transition threshold is a variant of the AMP algorithm where the denoiser is replaced 
by the Bayesian estimator (conditional mean) of the input under additive Gaussian noise. See also 
[74, Section 6.2] about how to incorporate the prior information into the AMP algorithm. 

One of our main findings is Theorem 10 which shows that i.i.d. Gaussian sensing matri- 
ces achieve the same phase-transition threshold as optimal nonlinear encoding, for any discrete- 
continuous mixture. This result is universal in the sense that it holds for arbitrary noise distribu- 
tions with finite non-Gaussianness. Moreover, the fundamental limit depends on the input statistics 
only through the weight of the analog component, regardless of the specific discrete and continuous 
components. The argument used in the proof of Theorem 10 relies crucially on the Gaussianness 
of the sensing matrices because of two reasons: 

• The upper bound on the distribution function of the least singular value in Lemma 1 is a 
direct consequence of the upper bound on its density (due to Edelman [65]), which is only 
known in the Gaussian case. In fact, we only need that the exponent in (94) diverges as 
i — )• 0. It is possible to generalize this result to other sub-Gaussian ensembles with densities 
by adapting the arguments in [66, Theorem 1.1]. However, it should be noted that in general 
Lemma 1 does not hold for discrete ensembles (e.g. Rademacher), because the least singular 
value always has a mass at zero with a fixed exponent; 

• Due to the rotational invariance of the Gaussian ensemble, the result in Lemma 2 does not 
depend on the basis of the subspace. 

Another contribution of this work is the rigorous proof of the phase transition thresholds for 
mixture distributions. Furthermore, based on the MMSE dimension results in [29], we have shown 
in Section 4.6 that these conclusions coincide with previous predictions put forth on the basis of 
replica-symmetry heuristics. 

One interesting direction is to investigate the optimal sensing matrix in a minimax sense. While 
our Theorem 10 shows that optimized sensing matrices (or even non-linear encoders) do not improve 
the phase transition threshold for Gaussian sensing matrices, it should be interesting to ascertain 
whether this conclusion carries over to the minimax setup, i.e., whether it is possible to lower the 
minimax phase transition threshold of the noise sensitivity achieved by Gaussian sensing matrices 
and LASSO or AMP reconstruction algorithms computed in [13] by optimizing the sensing matrix 
subject to the Frobenius-norm constraint in (41). 
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Appendix A Proof of the middle inequality in (46) 

We show that for any fixed e > 0, 

Dl{X, R, a^) < Di^iX, R, (1 + efa^). (187) 

By the continuity of D^(X, R, a'^) proved in Theorem 7, Dl{X,R,a'^) is also contin- 

uous. Therefore sending e | in (187) yields the second inequality in (46). To show (187), recall 

that A consists of i.i.d. entries with zero mean and variance -. Since k = nR 

n 

n — )• oo, by the weak law of large numbers. Therefore P{A G En} — )• 1 where 

En^[A:\\A\\l<k{l + ef]. 



' k 



Therefore 



Dl(X, R,{l + e)a'' 
1 



limsup -mmse ( X"| AX" + (1 + eYa'N\ A 



= limsup — mmse I X"^' 

n— >oo \ 

¥{AeEn}^ 
> lim sup E 



X^ + N\A 



1 + e 
mmse ( X" 



1 + e 



AeEn 



= DUX,R,a^), 

where (191) holds because satisfies the power constraint for any A E En- 



1 as 



(188) 



(189) 

(190) 

(191) 
(192) 



Appendix B Distortion-rate tradeoff of Gaussian inputs 

In this appendix we show the expressions (53) - (55) for the minimal distortion, thereby com- 
pleting the proof of Theorem 8 

B.l Optimal encoder 

Plugging the rate-distortion function of the standard Gaussian i.i.d. random process with 
mean-square error distortion 

i2xJI?) = ^log+-^ (193) 

into (40) yields the equality in (53). 



B.2 Optimal linear encoder 

The minimal distortion D^{X, R, cr^) achievable with the optimal linear encoder can be obtained 
using the finite-dimensional results in [75, Equations (31) - (35)], which are obtained for Gaussian 
input and noise of arbitrary covariance matrices. We include a proof for the sake of completeness. 

Denote the sensing matrix by H. Since X" and = HX" + aN^ are jointly Gaussian, the 
conditional distribution of X" given is AA(X", Sl^niyfc), where 



X" = H^(HH^ +(T'lk)-'Y' 
= (I„ + (T-2hTh)-\ 



(194) 
(195) 
(196) 
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where we used the matrix inversion lemma. Therefore, the optimal estimator is linear, given by 
(194). Moreover, 

mmse(X"|y*^) = Tr(I]^„|yfe) (197) 
= Tr((I„ + a-2HTH)-i). (198) 

Choosing the best encoding matrix H E M^^" boils down to the following optimization problem: 

min Tr((I„ + o--2h'^h)-1) 

^ " (199) 
s.t. Tr(H^H) <k 

Let H"^H = U'^AU, where U is an n x n orthogonal matrix and A is a diagonal matrix consisting 
of the eigenvalues of H'^H, denoted by {Ai, . . . , A„} C M+. Then 

Tr((I„ + a-^H^H)-!) = , (200) 

1=1 

Tl 

,TKHTH) (201) 

n 

where (201) follows from the strict convexity of x i— )• on M+ and Tr(HTH) = Y.7=i while 

(202) is due to the power constraint and R = ^. Hence 

Dl{Xc,R,a^)> ^^^^_2 - (203) 
Next we consider two cases separately: 

1. R > l(k > n): the lower bound in (203) can be achieved by 



H 







(204) 



2. R < l{k < n): the lower bound in (203) is not achievable. This is because to achieve equality 
in (201), all Aj must be equal to R; however, rank(H"'^H) < rank(H) < k < n implies that at 
least n — k of them are zero. Therefore the lower bound (202) can be further improved to: 

Tr((I„ + a-^H^H)-^) = n-fc+ V ^ , (205) 

>n-k+ \ (206) 

1 + ^" k 



k 

Hence when R < 1, 



>n-r^^. (207) 



Dl{Xc;,R,a^)>l--^, (208) 
which can be achieved by 

H = [Ifc 0] , (209) 
that is, simply keeping the first k coordinates of and discarding the rest. 
Therefore the equality in (54) is proved. 
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B.3 Random linear encoder 

We compute the distortion Di^{X, R, o"^) achievable with random Unear encoder A. RecaU that 
A has i.i.d. entries with zero mean and variance K By (198), 

-mmse(X"| AX" + aN^, A) = -E [Tr((I„ + a-^A^A)-^)] (210) 



n n 

n 



n 



E — 



(211) 



where {Ai, . . . , A„} are the eigenvalues of A A 

R 



As n — 7- oo, the empirical distribution of the eigenvalues of 4A^A converges weakly to the 



Marcenko-Pastur law almost surely [76, Theorem 2.35]: 

VR{dx) = (1 - R)+5,{dx) + ,](x)dx (212) 

where 

c=^,a = {l-^c)\h={l + ^cf. (213) 

Since A i— )• is continuous and bounded, applying the dominated convergence theorem to 

(211) and integrating with respect to un gives 

Di^{Xq,R,(t'^)= lim -mmse(X"|AX" + (JiV^A) 

n— >oo n 

= ^ (l - fl - <7'^ + \/(l - + 2(1 + + <t4) . (215) 

where (215) follows from [76, (1.16)]. 

Next we verify that the formula (73) in the replica symmetry postulate which was based on 
replica calculations coincides with (215) in the Gaussian case. Since in this case mmse(XG, snr) = 
(74) becomes 

1 1 

- = 1 + ^ 

= 1+ ^ ^ „ (217) 



1 + ^mmse(X, r/iJcj"^) (216) 



whose unique positive solution is given by 



_ i;_l_^2^^(l_^p^2(l + i^)a2 + ^4 



which lies in (0, 1). According to (73), 



Di^iXG,R,a^) = mmseiXG,a-^r]„) (219) 

(220) 

(221) 



1 + cr ^1]^ 

^ 2^2 

~ R-l + a^ + V(l-i?)2 + 2(l + i?)CT2 + a4^ 

which can be verified, after straightforward algebra, to coincide with (215). 
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Appendix C LASSO noise sensitivity for fixed input distributions 



Based on the results in [59], in this appendix we show that the asymptotic noise sensitivity of 
LASSO is given by (92). Let R = ^, and let A denote a. k X n random matrix with i.i.d. entries 

distributed according to A/'(0, ^). Then R^^A has M{0,^) entries, to which the result in [59] 
applies. Let g{y, A; A) denote the LASSO procedure with penalization parameter A defined in (88), 
which satisfies the following scaling-invariant property 



g{ty,tA;tX) = giy,A;X) 



(222) 



for any i > 0. By [59, Corollary 1.6] and (222), the MSE achieved by the LASSO decoder is given 
by 



hm -E 

n— >oo Ti 

hm -E 
Rt^ - fj^, 



|X"-5(AA:" + aA^^A)|| 
|X" - giR-^AX" + R-^(tN^- \R- 



with being the unique solution to the following equation in t"^: 



where rj{-; ■) : 



Rt^ = cj2 + E [{r]{X + tN; ar) - Xf] , 
is the soft-thresholding estimator 
r]{x; 9) = {x- 9)l{^>e} + {x + 6')1|^<_0| 



(223) 

(224) 
(225) 

(226) 

(227) 



and a = a{XR^^ ) with a{-) being the strictly increasing function defined in [59, p. 1999]. Therefore 
optimizing D^'^\X, R, o"^) over A is equivalent to optimizing over a. 

Next we assume that X is distributed according to the mixture (90), where Q is an arbitrary 
probability measure such that Q({0}) = 0. We analyze the weak-noise behavior of D^^\X,R,a'^) 
when R > R±(7) defined in (84). We show that for fixed a > 0, 



E [{ri{X + TN-aT) - Xf] 

(7(1 + a^) + 2(1 - 7)((1 + a2)$(-a) - a^{a))y{l + o(l)) 



(228) 



as r — 7- 0. Assembling (84), (225), (226) and (228), we obtain the formula for the asymptotic noise 
sensitivity of optimized LASSO: 



$(X, R) = inf lim 

A a2^0 



DW(X, R,a^) 



R±(7) 
R-RdlV 



(229) 



which holds for any Q with no mass at zero. 

We now complete the proof of (229) by establishing (228). Let X' ~ Q. By (90), 

E [{ri{X + tN; ar) - Xf] = (1 - 7)E [ri'^irN; ar)] 

+ 7E [{r]{X' + TN;aT)-X'f] , 



(230) 
(231) 
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where 



and 



E [r?2(riV; ar)] = 2t^ E [{N - afl^^^^^] 

= 2((1 + a^)^-a) - aip{a)) 



E [{7]{X' + TN-aT)-X'f] 

r2(E [{N - afl{x'+rN>ar}\ + E [(iV + afl{x'+rN<-ar}\ ) 
+ E [X''^l{\X'+TZ\<aT}\ ■ 



(232) 
(233) 



(234) 
(235) 



Since l{x'+TN>aT} l{X'>o}, '^{x'+rNK-ar} l{X'<o} and P{X' = 0} = 0, applying the 
bounded convergence theorem to the right-hand side of (234) yields r^(l-|-Q^)(l-|-o(l)). It remains 
to show that the term in (235) is o(r^). Indeed, as r — )• 0, 



r ^E [X'^l{|X'+rZ|<aT}] 

X' 



< 2ar~2E 
= o(l), 



T 

r 



+ a - ^> 



+ a 



+ a 



where we have applied the bounded convergence theorem since 



\X' 



+ a I < maxt ^pi—t + a) 

T / t>0 



(a + \/8 + a2)2 / a 

. — 



V8 + a2 



(236) 

(237) 
(238) 

(239) 



and r-^X'V {-\^ + as r ^ because P {|X'| > 0} = 1, completing the proof of (229) 

if i? > R±(7). In the case R < R±(7), the same reasoning yields that liminfCT2^o D^'^\X, R, a^) > 
for any choice of A. 
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